Analyzing Social Media Data to Understand Consumer Information Needs on Dietary Supplements

Despite the high consumption of dietary supplements (DS), few reliable, relevant, and comprehensive online resources could satisfy information seekers. This research study aims to understand consumer information needs on DS using topic modeling, and to evaluate accuracy in correctly identifying topics from social media. We retrieved 16,095 unique questions posted on Yahoo! Answers relating to 438 unique DS ingredients mentioned in sub-section, “Alternative medicine” under the section, “Health”. We implemented an unsupervised topic modeling method, Correlation Explanation (CorEx) to unveil the various topics in which consumers are most interested. We manually reviewed the keywords of all the 200 topics generated by CorEx and assigned them to 38 health-related categories, corresponding to 12 higher-level groups. We found high accuracy (90–100%) in identifying questions that correctly align with the selected topics. The results could guide us to generate a more comprehensive and structured DS resource based on consumers’ information needs.


Introduction
Dietary supplements (DS) usage has gained popularity in recent years with almost 52% of U.S. adults reporting the use of one or more supplement [1]. This high DS usage is This article is published online with Open Access by IOS Press and distributed under the terms of the Creative Commons Attribution Non-Commercial License 4.0 (CC BY-NC 4.0). especially common among adults aged ≥60 years, where 70% have reported using one or more DS [2]. In spite of this escalating trend in DS consumption across a wide range of consumers, there are not many online resources that consumers could refer to for DS information that is personalized, reliable, succinct, up-to-date, and in a language that is easily comprehensible by a lay-person.
In recent years, the internet has emerged as an important source of health-related information providing an opportunity for people to search online for free health information. According to a Pew Research Center report, 80% of internet users have looked for health information online [3,4]. This would be especially true in the case of DS as its use is primarily self-initiated rather than based on clinicians' recommendations [5]. Existing online DS health information resources in the U.S. can range from open access, publicly available databases, e.g., Food and Drug and Administration (FDA) [6]; Office of Dietary supplements (ODS) [7]; Dietary Supplement Label Database (DSLD) [8], to commercial databases that often require a paid subscription, e.g., Natural Medicines (NM) [9]. When it comes to personalized queries from consumers, the information is often consolidated under online resources such as "Frequently Asked Questions". However, the information dissipated from such resources is often very basic, non-specific, and not very helpful.
The rapid growth of digital data in today's world, especially in the healthcare domain, offers great opportunities for secondary use in clinical research. Topic modeling [10] has been an area of great interest and to date, several studies have been conducted to make use of electronic data and utilize this novel methodology. The reason for topic modeling's growing popularity is the area's ability to reveal the latent structure and groupings of the underlying corpus without any prerequisite knowledge. Some of the applications of topic modeling in healthcare research include: analyzing clinical notes from Electronic Health Record (EHR) data; discovering and understanding health care trajectories [11]; identifying medication prescribing patterns [12]; mining adverse events of DS from product labels [13]; and discovering health topics in social media [14,15] among various others.
There are various social Questions and Answers (Q&A) sites and online forums within health communities, e.g., Yahoo! Answers, allowing one to seek information through posting questions and receiving answers from others users (e.g., consumers, health professionals) [16]. Previously, we have used Yahoo! Answers data in several studies e.g., to investigate the terminology and language gap between health consumers and health professionals [17]; to mine consumer friendly medical terms to enrich consumer health vocabulary [16]; and to understand the information needs for diabetes patients about their laboratory results [18].
The purpose of this research study is to understand the information needs of DS consumers by analyzing questions coming directly from consumers and in their own language. The goal is achieved by using Correlation Explanation (CorEx) -a topic modeling algorithm on the title and body of each question under the Q&A section of the Yahoo! Answers database in order to unveil the "topics" around DS information needs. We generated a list of coherent topics that more accurately represent the areas of DS-related information and associated DS ingredients that consumers are most interested in. We will also evaluate the accuracy of the CorEx method in correctly identifying the topics from social media. In the future, the knowledge gained from this study could be used as a guide for developing more meaningful DS resources for consumers that are better aligned with their information needs. Figure 1 illustrates the overview of the methods. We extracted and pre-processed questions retrieved from the Yahoo! Answers database, focusing on questions around DS. We performed topic modeling using CorEx in order to understand DS-related topics and categories that consumers are most interested in. We then evaluated the accuracy of the topic modeling methodology by manually reviewing a subset of top ranked questions. We further investigated the actual DS ingredients associated with all the questions under each topic.

Collecting and Pre-processing Data
We collected in total 2,820,179 Yahoo! Answer questions and the corresponding answers posted under 21 sub-categories belonging to the main category "Health". We further extracted 112,090 questions (including their titles and contents) from one of the subcategories "Alternative Medicine". We then matched the preferred DS names in "iDISK", the first Integrated Dietary Supplements Knowledge base where DS related information is represented in a comprehensive and standardized form [19], with the DS ingredient name in the questions. After two assessors (YW, RR) had manually reviewed the matched preferred names, we cleaned up the DS ingredient names list based on the following rules: 1) only including ingredients with more than 5 matched questions; 2) excluding commonly consumed everyday food/drink items, e.g., fruits, vegetables, wine, caffeine, and water; 3) excluding body parts, e.g., adrenal cortex, brain, and stomach; and 4) excluding recreational drugs e.g., marijuana, poppy seed. Only the questions that exactly matched the DS ingredient names on this list were kept.
These questions were further pre-processed by subject matter experts (TN, JV) and used for topic modeling. We removed all ingredient mentions within the questions to understand the information needs non-specific to certain DS. Each question was then lower-cased and tokenized. Special characters, hyperlinks, and common stop-words (e.g., 'I', 'you', etc.) were removed, and each word was normalized using the normalized string generator (Norm) from the SPECIALIST NLP tool [20]. We only considered words that had at least 3 characters, since any word shorter than that was usually not meaningful. We also removed words that occurred fewer than five times, or more than 85% of the time, as they might not contribute much to the question.

Identifying Topics for DS Questions
In our preliminary investigation of different topic modeling strategies, we found that Correlation Explanation (CorEx) [21] discovered the most coherent topics compared to Latent Dirichlet Allocation (LDA). In contrast to LDA, which defines a generative model for inferring topics, CorEx discovers topics by maximizing the mutual information between words and topics. A subjective assessment of topic quality was performed by two assessors/co-authors and subject matter experts (YW, RR). A topic was considered "coherent" by the experts if assessors found a clear semantic criterion that unites the words under a particular topic. In total, we evaluated several results corresponding to various CorEx models on different numbers of topics (n = 100, 150, 200, 250). Comparing topic modeling results from 100 to 250 topics, we found the model with 200 topics yields the most coherent topic categories.
The selected model was further analyzed and assigned topic names after mutual agreement between two assessors (YW, RR). The "topics" with similar themes were then merged into "categories" (e.g., gastrointestinal disorders, psychiatric disorders) that were further condensed into higher level "groups" (e.g., "uses and symptoms"). For the group, "uses and symptoms", we utilized System Organ Classification (SOC) created by the Medical Dictionary for Regulatory Activities (MedDRA), a medical terminology used to classify adverse event information associated with the use of biopharmaceuticals and other medical products [22].

Topic Evaluation
To evaluate the accuracy of the topic modeling, we selected 15 topics and extracted their corresponding 10 questions with highest ranked probabilities. Manual review (RR, YW) was conducted to determine if the extracted questions correctly aligned with topics generated by the above topic modeling methods. The measure of correctness was reported as percentage accuracy. We also extracted the DS ingredient names corresponding to each topic in order to explore the distribution of ingredients names across various topics. We also reported the DS ingredients associated with most questions for selected topics.

Question Data and Topic Analysis
The final list consisted of 438 unique DS terms in total associated with 16,095 unique matching questions. After data pre-processing, our corpus contained a total of 213,790 tokens, which made up of 7,164 unique words.
From the 200 topics generated by CorEx modeling method, the domain experts (RR, YW) identified topics with similar themes and classified them into 38 unique categories by ( Table  1). The 38 unique categories were further summarized into the following 12 higher level groups: uses or adverse effects, product-related, healthy lifestyle, information resources/ scientific evidence, addiction, time of use qualifier, sleep disorder, interventions, adverse effect in general, health benefits, mind and body, and population qualifier. The distribution of higher-level groups and number of their associated categories is provided below ( Figure  2).
After evaluating the top 10 ranked questions for selected topics, we reported accuracy as number and percentage of questions that correctly aligns with the generated topic. Table 1 lists examples of selected topic groups, their associated categories along with the top 15 most probable words and common ingredients mentions.
We found the percent accuracy for most of the selected topics is between 90% -100%, except sleep (80%) and frequency/time categories (70%). "Use and adverse effects" is the most dominant topic group and accounted for 50 topics out of 200. Under this topic, there were 15 categories classified based on MedDRA SOC (Figure 3).

Dietary Supplements Associated with Most Questions
We also extracted the DS ingredient names associated with most questions corresponding to a particular topic in order to explore the distribution of most commonly discussed ingredients. Only DS ingredients associated with ≥10% of questions under a specific topic were reported (Table 2).

Discussion
In this study, we employed CorEx topic modeling over user-generated questions coming from the Yahoo! Answers data in order to better understand the information needs of consumers. We also discovered interesting information on the distribution of DS ingredients across topics of special interest to consumers. This research effort further validates the feasibility of topic modeling to extract important information hidden in large corpus of social media data.
Applying CorEx topic modeling methods, we were able to accurately identify 12 topic groups. The top three groups with the most number of respective assigned categories and topics, which can be regarded as the information most sought by consumers, are: "use and adverse effects", "product-related", and "healthy life style" (Figure 2). Extracted information pertaining to any symptom or sign could either be an indication or an adverse event of a DS, (e.g., diarrhea, abdominal pain, palpitations, headaches); therefore, uses and adverse effects were combined as one group, "use and adverse effects". We found a higher number of topics and the associated number of questions concerning: gastrointestinal system (specifically diarrhea and constipation); psychiatric (mainly anxiety and depression); and skin and subcutaneous tissues (primarily acne and UV protection). We also had a "mixed group", having keywords corresponding to more than one system. For "product-related groups", we merged categories like dose, dose from, preparation because of their co-occurrence under one topic (e.g., Topic #43). Under the "healthy life style" group, the topics were mostly around eating healthy and weight control/exercise.
We found high accuracy when we identified questions that correctly align with the topic categories/groups (Table 1). We found few low matching accuracy topics also having questions related to other topics, e.g., sleeping disorders topic with questions related to recreational drug, anxiety/depression. We also reported actual DS ingredient names associated with most questions for a particular topic ( Table 2). We found a substantially higher percentage of questions for the ingredients "Honey" under respiratory disorders and "Melatonin" under sleep disorders. This information provides essential knowledge on the use of DS for various specific reasons and needs further exploration.
This research study had several limitations. We analyzed only questions belonging to alternative medicine sub-category under "health" section and might have missed dietary supplement occurrences under other sub-categories, e.g., mental health conditions, general health care. We only used preferred DS ingredient names and not their synonyms (e.g., scientific names, common names) to extract the corresponding questions. Also, there are inherent limitations to topic modeling e.g., topics were generated based on the statistical word distribution within the questions and thus topics with incoherent topic keywords were also generated.

Conclusions
This research provides essential insights on extracting and understanding the information needs of consumers around dietary supplements using CorEx-based topic modeling that could identify the relevant topics embedded in a large corpus of Yahoo! Answers data with high accuracy. The knowledge gained here could be used to generate a more comprehensive repository of resource for consumers around dietary supplements usage. Thus, this study is an important contribution in further accentuating the potential benefits of using social media data in the clinical research.